Point and interval estimates
Eva Freyhult
NBIS, SciLifeLab
2022-09-13
The sample proportion and sample mean are unbiased estimates of the population proportion and population mean.
The sample estimate is our best guess, but it will not be without error.
Pollen example
If we are interested in how large proportion of the Uppsala population is allergic to pollen, we can investigate this by studying a random sample. We randomly select 100 persons in Uppsala and observe that 42 have a pollen allergy.
Based on this observation our point estimate of the Uppsla popultation proportion \(\pi\) is \(\pi \approx p = 0.42\).
We know that there is a certain uncertainty in this measurement, if the experiment is repeated we would select 100 other persons and our point estimate would be slightly different.
Using bootstrap we can sample with replacement from our sample to estimate the uncertainty.
Bootstrap is to use the data we have (our sample) and sample repeatedly with replacement from this data.
Put the entire sample in an urn!
Sample from the urn with replacement to compute the bootstrap distribution.
Sample a ball with replacement 100 times and note the proportion allergic (black balls).
Repeat this many times to get a bootstrap distribution
Using the bootstrap distribution the uncertainty of our estimate of \(\pi\) can be estimated.
The 95% bootstrap interval is [0.32, 0.52].
The bootstrap is very useful if you do not know the distribution of our sampled propery. But in our example we actually do.
A confidence interval is a type of interval estimate associated with a confidence level.
An interval that with probability \(1 - \alpha\) cover the population parameter \(\theta\) is called a confidence interval for \(\theta\) with confidence level \(1 - \alpha\).
Remember that we can use the central limit theorem to show that
\[P \sim N\left(\pi, SE\right) \iff P \sim \left(\pi, \sqrt{\frac{\pi(1-\pi)}{n}}\right)\]
It follows that
\[Z = \frac{P - \pi}{SE} \sim N(0,1)\] Based on what we know of the standard normal distribution, we can compute an interval around the population property \(\pi\) such that the probability that a sample property \(p\) falls within this interval is \(1-\alpha\).
\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\]
\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\]
\(z_{\alpha/2}\) is the value such that \(P(Z \geq z_{\alpha/2}) = \frac{\alpha}{2} \iff P(Z \leq z_{\alpha/2}) = 1 - \frac{\alpha}{2}\).
For a 95% confidence, \(\alpha = 0.05\), and \(z_{\alpha/2} = 1.96\). For 90% or 99% confidence \(z_{0.05} = 1.64\) and \(z_{0.005}=2.58\).
\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\\ P(-z_{\alpha/2} < \frac{P - \pi}{SE} < z_{\alpha/2}) = 1 - \alpha\]
We can rewrite this to
\[P\left(\pi-z_{\alpha/2} SE < P < \pi + z_{\alpha/2} SE\right) = 1-\alpha\] In words, a sample fraction \(p\) will fall between \(\pi \pm z_{\alpha/2} SE\) with probability \(1- \alpha\).
The equation can also be rewritten to
\[P\left(P-z SE < \pi < P + z SE\right) = 1 - \alpha\]
The observed confidence interval is what we get when we replace the random variable \(P\) with our observed fraction,
\[p-z SE < \pi < p + z SE\] \[\pi = p \pm z SE = p \pm z \sqrt{\frac{p(1-p)}{n}}\]
The 95% confidence interval \[\pi = p \pm 1.96 \sqrt{\frac{p(1-p)}{n}}\]
A 95% confidence interval will have 95% chance to cover the true value.
Back to our example of proportion pollen allergic in Uppsala. \(p=0.42\) and \(SE=\sqrt{\frac{p(1-p)}{n}} = 0.0494\).
Hence, the 95% confidence interval is \[\pi = 0.42 \pm 1.96 * 0.05 = 0.42 \pm 0.092\] or \[(0.42-0.092, 0.42+0.092) = (0.32, 0.52)\]
The mean of a sample of \(n\) independent and identically normal distributed observations \(X_i\) is normally distributed;
\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]
If \(\sigma\) is unknown the statistic
\[\frac{\bar X - \mu}{SEM} = \frac{\bar X - \mu}{\frac{s}{\sqrt{n}}} \sim t(n-1)\] is t-distributed with \(n-1\) degrees of freedom.
It follows that
\[ \begin{aligned} P\left(-t < \frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}} < t\right) = 1 - \alpha \iff \\ P\left(\bar X - t \frac{\sigma}{\sqrt{n}} < \mu < \bar X + t \frac{\sigma}{\sqrt{n}}\right) = 1 - \alpha \end{aligned} \]
The confidence interval with confidence level \(1-\alpha\) is thus;
\[\mu = \bar x \pm t \frac{s}{\sqrt{n}}\]
For a 95% confidence interval and \(n=5\), \(t=\) 2.7764.
The \(t\) values for different values of \(\alpha\) and degrees of freedom are tabulated and can be computed in R using the function qt.